Within and across sentence boundary language model
نویسندگان
چکیده
In this paper, we propose two different language modeling approaches, namely skip trigram and across sentence boundary, to capture the long range dependencies. The skip trigram model is able to cover more predecessor words of the present word compared to the normal trigram while the same memory space is required. The across sentence boundary model uses the word distribution of the previous sentences to calculate the unigram probability which is applied as the emission probability in the word and the class model frameworks. Our experiments on the Penn Treebank [1] show that each of our proposed models and also their combination significantly outperform the baseline for both the word and the class models and their linear interpolation. The linear interpolation of the word and the class models with the proposed skip trigram and across sentence boundary models achieves 118.4 perplexity while the best state-of-the-art language model has a perplexity of 137.2 on the same dataset.
منابع مشابه
A study in machine learning from imbalanced data for sentence boundary detection in speech
Enriching speech recognition output with sentence boundaries improves its human readability and enables further processing by downstream language processing modules. We have constructed a hidden Markov model (HMM) system to detect sentence boundaries that uses both prosodic and textual information. Since there are more nonsentence boundaries than sentence boundaries in the data, the prosody mod...
متن کاملAn Automatic Sentence Bou Based on a Structured La
In this paper we describe an automatic sentence boundary detector, which inserts a period (sentence boundary marker) to a word sequence output by a speech recognizer. The state-ofthe-art automatic sentence boundary detectors insert a period at a position selected by a word tri-gram model from among candidates (long pauses) offered by an accoustic model. In contrast, the automatic sentence bound...
متن کاملThe consistency of sentence intelligibility across three types of signal distortion.
PURPOSE To examine the extent to which sentences retain their levels of spoken intelligibility relative to other sentences in a set (the sentence effect) across different types of signal distortion. METHOD The Central Institute for the Deaf (CID) sentences were rendered difficult to understand through the addition of broadband noise. These intelligibility data were compared with those from pr...
متن کاملAn Investigation on the Relationship between the Grammatical Competence of Young Iranian English Translation Students and their Ability to Translate from English to Farsi
Today, everything has changed and this has brought a need for learning a second language. Most countries across the world use English as their second/foreign language and the fundamental part of this process is grammar, i.e., the combination of sound, structure, and meaning system of language. A sentence can be composed of several words, clauses, as well as grammatical rules. These grammat...
متن کاملA hybrid approach for urdu sentence boundary disambiguation
Sentence boundary identification is a preliminary step for preparing a text document for Natural Language Processing tasks, e.g., machine translation, POS tagging, text summarization and etc. We present a hybrid approach for Urdu sentence boundary disambiguation comprising of unigram statistical model and rule based algorithm. After implementing this approach, we obtained 99.48% precision, 86.3...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010